About the project

This course is an open datascience course using Git and R. My github repository is at https://github.com/KatariinaParnanen/IODS-project


Week 2: Data wrangling, regression analysis and model validation

This week we have learned about data wrangling using R and dplyr package. We also are introduced to basic regression analysis and model validation using diagnostic plots.

Reading in data

We will first read in data from local directory. The data is based on a questionaire done on students of a course and their exam points.

#Read in table
lrn2014<-read.table("/Users/kparnane/Documents/IODS_course/IODS-project/data/learning2014.txt", header=TRUE, sep="\t")

We will explore the dataset’s dimension and structure.

#Explore the dimensions
dim(lrn2014)
## [1] 166   7
#Explore the structure of the file
str(lrn2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

The data has 166 observations and 7 variable columns.

The dataset has a response variable “points”" and explanatory variables “age”, “attitude”, “deep”, “surf” and “stra”. “deep” means points for deep learning questions, “surf” points for surface learning questions, “stra” means strategic learning techniques. “points” means exam score. More info on the dataset can be found here.

Data exploration

We will explore correlations and distributions of explanatory and response factors using “ggpairs” which plots pairwise correlations and variable disturbutions.

#Load required packages
library(GGally)
library(ggplot2)

#Draw plot
ggpairs(lrn2014, mapping = aes(col = gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
__Pairwise correlation plot__: Exploration of variables and correlations

Pairwise correlation plot: Exploration of variables and correlations

#Get summaries of variables
summary(lrn2014)
##  gender       age           attitude         points           deep      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   1st Qu.:3.333  
##          Median :22.00   Median :32.00   Median :23.00   Median :3.667  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72   Mean   :3.680  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75   3rd Qu.:4.083  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00   Max.   :4.917  
##       surf            stra      
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:2.417   1st Qu.:2.625  
##  Median :2.833   Median :3.188  
##  Mean   :2.787   Mean   :3.121  
##  3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.333   Max.   :5.000

There are more females than males. Other factors besides age seem to roughly follow a normal distribution. Mean age is 25. We are interested in the response variable points and how the explanatory variables corrrelate with it. The highest correlations are with “attitude”, “stra” and there is a negative correlation with “surf”.

Linear models

Now we use the three variables with possible correlations with exam points identified in the data exploration in linear models.

# fit a linear model
my_model <- lm(points ~ attitude+stra+surf, data = lrn2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Variable “attitude” is positively correlated with points with a p-value of 1.93e-08 and estimate 0.33952. The other variables are not significantly correlated with exam points so they can be excluded from the model.

Drop the not significant variables and fit the model again.

# fit a linear model
my_model2 <- lm(points ~ attitude, data = lrn2014)

#Plot the regression line in scatter plot with exam points versus attitude
qplot(attitude, points, data = lrn2014) + geom_smooth(method = "lm")
__Scatter plot with regression line__: Relationship of points and attitude

Scatter plot with regression line: Relationship of points and attitude

Print summary of the model.

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

There is a significant relationship with the variable attitude exam points. The attitude variable has a p-value of 1.95e-09, estimate of 0.35255 and standard error of 0.05674. The R-squared value is 0.1856 meaning that attitude variable explains roughly 19% of the variation in the model.

Diagnostic plots

We will produce diagnostic plots to see if the model fits the assumptions of linear models. The assumptions are that the size of the error is not dependent on the variable value and the errors are normally distributed and that the explanatory variables are not correlated with each other. The dependent variables should also be independent from each other.

# Draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which=c(1, 2, 5))
__Diagnostic plots__: Plots for exploring model errors

Diagnostic plots: Plots for exploring model errors

There doesn’t seem to be any patterns in the residuals vs fitted values plot so the size of the residuals are not dependent on the fitted values. The residuals in the Q-Q plot follow the assumed regression line quite well so the assumption of normal distribution of errors is filled. The residuals vs leverage plot doesn’t reveal that there are any observations which have a unusually high effect on the model. The model also has only one explanatory variable so explanatory variable autocorrelation is not a problem. The observations are also independent from each other.Thus, we can conclude that the model fits the assumptions.


Week 3: Logistic regression

This week we have learned about logistic regression, piping and model validation.

Reading in data

Read in data and explore.

# Get access to useful libraries
library(dplyr);library(ggplot2);library(tidyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Read in table from local folder
alc<-read.table(file = "data/alc.txt", sep="\t", header=TRUE)

#Explore the data using glimpse

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
# Draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

The data has 382 observations and 35 variables from students whose alcohol consumption has been studied. The variables are either integers, numeric, factors or logicals. More information about the data can be found [here]https://archive.ics.uci.edu/ml/datasets/Student+Performance.

The data includes variables for sex, absences - absences from school, goout - going out and age. Usually there are differences between how much boys and girls drink due to social norms, absences from school also probably correlate with high use of alcohol, going out a lot also usually means higher alcohol consumption due to having more exposure to situations with alcohol and age might have an effect due to easier acccess to alcohol for older kids.

Logistic regression

Logistic regression is one type of generalized linear models where the response variable can only have two values for example true or false. In this data set the response variable we are interested is high alochol use, which can have true or false values.

Explore the relationship with high alcohol use and the chosen variables and the distribution of the variables

Explore how sex is related to high alcohol use. Our hypothesis is that males drink more than females.

#Explore the distributions of females and males in the data
g0<-ggplot(data = alc, aes(x=sex))

g0 + geom_bar()

#Explore how high use is dependent on sex using barplots
g1<-ggplot(data = alc, aes(x = high_use, fill = sex))

g1 + geom_bar()+facet_wrap("sex")

There are about equal numbers of males and females in the data. There seems to be more males who have high use compared to females, which makes sense if you think about the social norms hypothesis

Explore how absences are related to alcohol use using boxplots. Our hypothesis is that kids who are absent more are more likely to also drink more.

# Explore the distribution of absences
g6<-ggplot(alc, aes(x=absences, fill =sex))
g6+geom_bar() + ylab("absences") + ggtitle("Student absences by sex")

#Draw boxplots with high and low alcohol uses relation with absences
g3 <- ggplot(alc, aes(x=high_use, y=absences, col = sex))
g3 + geom_boxplot() +  ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

Absences are not normally distributed. There is a long tail and the most common value is 0. It seems that there are more absences in the kids with high alcohol use. This makes sense if you think that kids who drink more are also skipping more school. The patterns seem similar between females and males.

Explore how going out is related to alcohol use. Our hypothesis is that kids who are going out more are exposed to alcohol more and thus will also drink more.

#Explore the distibution of kids going out
g4 <- ggplot(alc, aes(x=goout, fill=sex))
g4 + geom_bar()

#Explore the relationship with high alcohol use
g5 <- ggplot(alc, aes(x=high_use, y=goout, col = sex))
g5 + geom_boxplot()  + ggtitle("Going out by alcohol consumption and sex")

Going out seems to be quite normally distributed between the values 1 and 5. However, there aren’t very many kids who don’t go out almost at all compared to the ones who go out very often. There is approximately equal numbers of both sexes in each going out class.

There definitely seems to be a relationship between going out and alcohol use. However, there might be some differences between going out often beign correlated with drinking more since in females there is overlap between the ones who go out often but don’t have high alcohol use with those who go out often and have high alcohol use.

Explore the relationship with age and alcohol use. Our hypothesis is that it is easier for older kids to get alcohol than for younger kids.

# Explore the distribution of age
g6<-ggplot(alc, aes(x=age, fill =sex))
g6+geom_bar()

# Explore the relationship of age  with alcohol use

g7 <- ggplot(alc, aes(y=age, x=high_use, col = sex))
g7 + geom_boxplot() + ylab("age") + ggtitle("Student age by alcohol consumption and sex")

Age is not normally distributed and there is quite a long tail for higher numbers after 19. Ages range from 15 to 22 with 16 being the median. For the values 18 and up, there is an approximately equal amount of females and males for each age.

There doesn’t seem to be a very obvious link between alcohol use and age that could be seen from the boxplots since the high use boxplots overlap with the low use bowplots. It might be that it is not very difficult to get alchol in the country of the kids even when you are younger than 18.

Build model with the variables that seemed interesting

Next we will build a GLM using binomial family with the interesting explanatory variables.

# Use generalized linear models to predict high use based on selected variables

m<-glm(high_use~age+sex+absences+goout, data=alc, family = "binomial")

#Print out summary and coefficients
summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + absences + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7877  -0.8026  -0.5393   0.8294   2.5201  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.60333    1.84395  -3.039 0.002376 ** 
## age          0.08951    0.11003   0.814 0.415905    
## sexM         0.96338    0.25490   3.779 0.000157 ***
## absences     0.08205    0.02247   3.651 0.000261 ***
## goout        0.71753    0.12058   5.951 2.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.09  on 377  degrees of freedom
## AIC: 397.09
## 
## Number of Fisher Scoring iterations: 4
#From the summary we can see that sex, absences and going out are significantly correlated with high alcohol use

coef(m)
## (Intercept)         age        sexM    absences       goout 
## -5.60333157  0.08951186  0.96338497  0.08204599  0.71752626
#From the coefficients we can calculate the odds ratio by taking the exponent function of the estimate


# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI<- confint(m) %>% exp
## Waiting for profiling to be done...
# Join the two together

cbind(OR, CI)
##                      OR        2.5 %    97.5 %
## (Intercept) 0.003685565 9.268741e-05 0.1296785
## age         1.093640305 8.815211e-01 1.3582243
## sexM        2.620551969 1.599662e+00 4.3546202
## absences    1.085505735 1.040056e+00 1.1371493
## goout       2.049357367 1.626970e+00 2.6130756
  • From the odds ratio we can see that for every year the is approximately 9% increase in the chances of having high alcohol use, but the confidence interval includes one, which means that age is not significant.

  • We can also see that sex male means a 2.6 times more likely chance of having high alcohol use. Female is the intercept factor. The CI does not include 1 so this going out is also significant.

  • One more absence from school causes an approximately 9% bigger chance of having high use and this is significant also the CI doesn’t include 1.

  • Going out from a scale to 1 to 5 is significantly correlated with high alcohol use as the CI doesn’t include 1. An increasement of 1 causes an approximately 2.6 more likely chance of having high use.

Cross validation

After obtaining the model, we will use cross validation with a training set to estimate the error rate of the model. We will use the model without age since age was not significant.

# Fit the model without age
m<-glm(high_use~sex+absences+goout, data=alc, family = "binomial")


# Predict the probability of high alcohol use
probabilities <- predict(m, type = "response")

# Add the predicted probabilities to the alc table
alc <- mutate(alc, probability = probabilities)

# Use the probabilities to make a prediction of high alcohol use as probabilitites instead of odds
alc <- mutate(alc, prediction = probabilities>0.5)

# See the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
##     failures absences sex high_use probability prediction
## 373        1        0   M    FALSE  0.14869987      FALSE
## 374        1        7   M     TRUE  0.39514446      FALSE
## 375        0        1   F    FALSE  0.13129452      FALSE
## 376        0        6   F    FALSE  0.18714923      FALSE
## 377        1        2   F    FALSE  0.07342805      FALSE
## 378        0        2   F    FALSE  0.25434555      FALSE
## 379        2        2   F    FALSE  0.07342805      FALSE
## 380        0        3   F    FALSE  0.03989428      FALSE
## 381        0        4   M     TRUE  0.68596604       TRUE
## 382        0        2   M     TRUE  0.09060457      FALSE
# From here you can see than when prob is >0.5 the prediction is "TRUE"

# Cross tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
# From here we can see that out of 65/318 predictions are wrong for predicted low alcohol use and 15/64 are wrong for the prediction of high alcohol use.

# Define a loss function to get the mean prediction error
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
# Do K-fold cross-validation using the loss_func defined previously
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2120419

There is approximately 22 % error in the prediction of high alcohol use using the model wiht going out, absences and age.


Week 4: Clustering and classification

This week we have learned about clustering, which is a handy tool for visualizing differences between data points. Clustering can be used for example for data exploration. We also learned about classification, which can be used to validate the results of clustering. Discriminant analysis was also touched upon.

Loading in the data and exploring it

We will load the dataset “Boston” from MASS package. The dataset includes data on crime rates per capita for towns in Boston with explanatory variables related to land use and inhabitant demographics.

# Access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)

# Load the data
data("Boston")

# Explore the dataset's dimentions and details of the variables
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Explore dataset with corrplot

#L
library(corrplot)
## corrplot 0.84 loaded
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston)%>%round(digits=2) 

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

There seems to be highest correlations with rad = index of accessibility to radial highways, tax=full-value property-tax rate per $10,000, lsat=lower status of the population (percent) and indus=proportion of non-retail business acres per town variables and lowest with medv=median value of owner-occupied homes in $1000s, black=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town and dis=weighted mean of distances to five Boston employment centres. Some of the variables have only very few high or low values which might cause good correlations, but the correlations might not be significant.

Scaling the variables

Next we will scale the variables using the scale function.

# Center and standardize variables
boston_scaled <- scale(Boston)

# Summaries of the scaled variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# Class of the boston_scaled object is matrix
class(boston_scaled)
## [1] "matrix"
# Change the matrix to a data frame
boston_scaled<-as.data.frame(boston_scaled)

# Summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# Create a categorical variable 'crime' using the quantiles as break points
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))

# Look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

The values are changed to distances from mean values (centering) and scaling by dividing the centered columns by their standard deviations.

Create training and test sets

Next we will create training and test sets from the data. We will use 80% of the data for training and then use the rest for testing our model.

# Number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# Choose  80% of the rows randomly
ind <- sample(n,  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

Linear discriminant analysis

Next we will fit a linear disriminant analysis for the data using the lda() command with all the explanatory variables.

# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2599010 0.2524752 0.2400990 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       0.91934427 -0.9176465 -0.15421606 -0.8550746  0.4900473
## med_low  -0.09767078 -0.2730274 -0.04735191 -0.5540392 -0.1585990
## med_high -0.39307863  0.1795891  0.22945822  0.4053551  0.1071106
## high     -0.48724019  1.0172187 -0.06938576  1.0285090 -0.4238136
##                 age        dis        rad        tax    ptratio      black
## low      -0.8407054  0.8151475 -0.6936060 -0.7262326 -0.3970232  0.3896097
## med_low  -0.3389343  0.3168673 -0.5509226 -0.4704292 -0.0494064  0.3162580
## med_high  0.4402226 -0.3734726 -0.4076377 -0.3047673 -0.2996246  0.1063285
## high      0.7927422 -0.8542498  1.6371072  1.5133254  0.7795879 -0.8579791
##                lstat        medv
## low      -0.77633602  0.55889263
## med_low  -0.12652733 -0.02520945
## med_high  0.04402535  0.16784243
## high      0.84339141 -0.68004621
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.07448495  0.66504070 -0.85566195
## indus    0.05275595 -0.22841003  0.43845619
## chas    -0.08676219 -0.10296902  0.08794376
## nox      0.36365407 -0.78109337 -1.37268666
## rm      -0.14817813 -0.09301656 -0.24698636
## age      0.26769510 -0.32788474 -0.17396315
## dis     -0.01794409 -0.30329234  0.04827372
## rad      3.22414317  0.82169610  0.04397460
## tax     -0.04574639  0.08478246  0.34117729
## ptratio  0.09153638  0.02385299 -0.31646042
## black   -0.14964951  0.01634448  0.12435279
## lstat    0.17341707 -0.23414633  0.24093020
## medv     0.20269644 -0.35522444 -0.28033194
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9482 0.0379 0.0139

From were we see that LDA1 explains >90% of the variance, which means that the model explains a great deal of the variation in the data. In the summary we can also see the means of the explanatory variables for different crime classes. For example rad and tax have quite big differences in their means between the low and high crime rates.

Drawing the LDA biplot

Next we will draw a plot and see how the different towns group in clustering. We will color and annotate the dots with crime rates and draw arrows for directions and weights of the different explanatory variables.

# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# Target classes as numeric
classes <- as.numeric(train$crime)

# Plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

From the plot it seems that rad (accessibility to radial highways) has the biggest impact on grouping. There are two clusters in the plot with one of them having almost al the highest crime rate towns and some medium high rates but none of the medium low or low crime rate towns. Seems like the model might only need rad to predict the crime rates.

Test set

Next we will validate the results using the test set we made earlier. We will predict crime rate classes with test data using the lda.fit model we created with the training set.

# Create test set 
test <- boston_scaled[-ind,]

# Save the correct classes from test data
correct_classes <- test$crime

# Remove the crime variable from test data
test <- dplyr::select(test, -crime)

## Predict classes with test data using the lda.fit model we created with the training set.
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17      10        0    0
##   med_low    4      14        3    0
##   med_high   1       5       17    1
##   high       0       0        0   30

From the cross tablulation, we can see that 20/20 high crime rates were predicted correctly to be high. However, two med_high cases were incorrectly classified as high. This is pretty good! For the low class, all of the correct lows were predicted to be lows, but 12 were predicted as med_low and one as meg_high.

The model strugles with correct classification of the med_low and med_high classes. Only six of the med_high classes was correctly classified and 20/26 were predicted to be some other class. Most were predicted to be med_low, but two were high and one low.

The model does a bit better with the med_low class. 17/28 are correctly classified as med_low, and five as low and 6 as med_high.

Predicting the best number of clusters in a dataset

Next we will reload the Boston dataset, scale it and use k-means algorithm to predict the optimal number of clusters in the dataset.

# Obtain dataset
data('Boston')

# Change the matrix to a data frame
boston_scaled2<-as.data.frame(Boston)

# K-means clustering
km <-kmeans(boston_scaled2, centers = 3)

# Plot the Boston dataset with clusters, use first seven variables
pairs(boston_scaled2[1:14], col = km$cluster)

The variables rad and tax seem to cluster the data pretty nicely into two clusters in the crime vs rad and crime vs tax plots. Three clusters might not be needed, but two might suffice. Let’s check this next.

Determining the optimal number of clusters

We will determine the number of clusters is to look by how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The best number of clusters is when the total WCSS drops dramatically, which can be observed in the plot.

# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# Determine the maximun number of clusters
k_max <- 10

# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# Visualize the results with qplot
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)

# Plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

Like we observed in the previous pairs plot, based on the qplot and WCSS it seems that the optimal number of clusters is 2. Rad and tax produce the best separation betweem the classes.